Complete PISO and SIMPLE solvers on Graphics Processing Units 



Tadeusz Tomczak'', Katarzyna Zadarnowska^, Zbigniew Koza'', Macicj Matyka*^, Lukasz Mirosiaw'^''^'* 

"■Institute of Computer Science, Control and Robotics, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, 

Poland 

''Faculty of Physics and Astronomy, University of Wroclaw, pi. M. Borna 9, 50-204 Wroclaw, Poland 
'^Institute of Informatics, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland 
<^Vratis Ltd, Muchoborska 18, PL54424 Wroclaw, Poland. Tel. -f48 71 7073419, Fax. -h48 71 7503104 



Abstract 



-We implemented the pressure-implicit with splitting of operators (PISO) and semi-implicit method for pressure-linked 
"equations (SIMPLE) solvers of the Navier-Stokes equations on Fermi-class graphics processing units (GPUs) using the 
CUDA technology. We also introduced a new format of sparse matrices optimized for performing elementary CFD 
operations, like gradient or divergence discretization, on GPUs. We verified the validity of the implementation on 
several standard, steady and unsteady problems. Computational efficiency of the GPU implementation was examined 
by comparing its double precision run times with those of essentially the same algorithms implemented in OpenFOAM. 
The results show that a GPU (Tcsla C2070) can outperform a server-class 6-core, 12-thread CPU (Intel Xcon X5670) 
by a factor of 4.2. 
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1. Introduction 

As the silicon technology approaches subsequent physi- 
cal barriers, keeping the exponential growth rate of com- 
putational power of computers poses numerous scientific 
and technological challenges. Today, most of performance 
improvements comes from increased parallelism. However, 
since a vast majority of the existing technologies for writ- 
ing parallel applications were designed for coarse-grained 
concurrency and rely on bulk-synchronous algorithms, fur- 
ther progress requires new computer architectures, algo- 
rithms and programming models aimed at fine-grained on- 
chip parallelism A promising attempt in this direction 
is the massively parallel architecture of modern graph- 
ics processor units (GPUs). In just a few years GPUs 
evolved into versatile programmable computing devices, 
whose peak computational performance matches that of 
the most powerful supercomputers of a decade ago. For 
this reason GPUs are used as numerical accelerators on a 
vast variety of systems, from laptops to many of today's 



26l | and tomorrow's j24| petafiop supercomputers. 



Many computational fluid dynamics (CFD) algorithms 
are inherently data-parallel, and hence suitable for GPU 
acceleration. There are, however, several obstacles to 
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reach this goal. First, existing industry-level CFD al- 
gorithms and data structures were mostly developed for 
sequential or coarse-grained parallel architectures. Sec- 
ond, the massively parallel architecture of GPUs imposes 
stiff conditions on the software to exploit the hardware 
efficiently. Finally, redesigning existing applications and 
porting them to GPUs using new programming paradigms 
requires a considerable time and effort. 

In we accelerated a standard finite volume CFD 
solver (OpenFOAM) by replacing its linear system solvers 
and sparse matrix vector product (SMVP) kernels with 
the corresponding GPU implementations. A significant 
speedup was achieved only for steady state problems, and 
we attributed marginal acceleration of transient problems 
to frequent data format conversions and additional data 
transfers. Here we tackle the problem of porting of a com- 
plete finite volume solver to the GPU. We chose to imple- 
ment pressure-implicit with splitting of operators (PISO) 
and semi-implicit method for pressure-linked equations 
(SIMPLE) solvers @ and examine whether eliminating in- 
termediate data transfers through a narrow CPU-CPU bus 
and adjusting the internal data format to the needs of the 
GPU is sufficient to significantly accelerate the time to so- 
lution. We also perform a more detailed analysis of the 
conditions necessary to obtain high acceleration rates. 

2. Related work 

The literature devoted to porting CFD algorithms to 
GPUs is ample, but a large part of the research has fo- 
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cuscd on CFD solvers based on structured, regular grids. 
While this approach facilitates coalescing of device mem- 
ory accesses and leads to a GPU-accelerated code that 
was reported as several [1| , tens ^ or even hundreds [l^ 
times faster than the corresponding CPU-only solution, 
the usability of such CFD programs is limited to simple 
geometries or applications in computer graphics [lH. For 
example, Cohen and Molemaker [3| accelerated a solver for 
the Boussinesq approximation of the Navicr-Stokcs equa- 
tions on a regular three-dimensional (3D) grid, and found 
an 8-fold speedup versus the corresponding multithreaded 
Fortran code running on an 8-core dual-socket Intel Xeon 
processor. Another example is the work of Tolke and 



Krafczyk [25|, who found their GPU implementation of 
a 3D Lattice-Boltzmann method for flows through porous 
media up to two orders of magnitude faster than a cor- 
responding single-core CPU code. While reports on two- 
or even three-digit speed gains should be interpreted cau- 
tiously [l^, they show a great potential of GPU accelera- 
tors. 

Our GPU implementation works on unstructured grids, 
which is necessary to make it applicable for realistically 
complex geometries. So far, optimized GPU implemen- 
tations of CFD solvers based on unstructured grids have 
been relatively rare, mostly because of stringent require- 
ments for efficient utilization of the GPU. For example, 
Klockner et al. [l5| implemented discontinuous Galerkin 
methods over unstructured grids and found that the high- 
est acceleration is achieved for higher-order cases due to 
their larger arithmetic intensity which helps hide indirect 
addressing latencies. Alternative optimization techniques 
were recently used by Corrigan ct al. ^ to show that a 
Tcsla 10 series GPU can accelerate a finite-volume solver 
for an inviscid, compressible flow over an unstructured grid 
by almost 10 times relative to an OpenMP-parallelized 
CPU code. 

There are also several reports on accelerating existing 
CFD software with CPUs. For example, Corrigan et al. 
d, Q used a Python script to semi-automatically translate 
OpcnMP loops to GPU kernels in a large-scale CFD code 
(nearly a million lines of Fortran 77 code), FEFLO. An- 
other strategy is to port complete solvers. This approach 
was used, for example, by Elscn et al. Q, who accelerated 
the Navier Stokes Stanford University Solver (NSSUS) and 
found the speedup of over 20 times for complex geome- 
tries containing up to 1.5 million grid points. Porting of 
the MBFL02 multi-block turbulent flow solver was de- 



scribed by Phillips et al. l21[ , who found a 9- fold speedup 
over the original CPU implementation. Both FEFLO and 
MBFL02 libraries are, however, based on structured grids. 

While there arc several libraries aimed at GPU- 
acceleration of existing unstructured-grid CFD libraries, 
e.g. Cufflink ( |http: / /cuffiink-library.googlecode.comp and 
ofgpu (http:/ /www. symscape.com/gpu-0-2-openfoam), 
their design adheres to the "partial acceleration" rather 
than "full port" strategy. In particular, they typically 
use CPUs only to accelerate some well-defined, time- 



consuming and data-parallel tasks, like solving large 
sparse linear systems. This approach introduces some 
artificial overhead related to frequent CPU — GPU — >■ 
CPU data transfers as well as data format conversions. 
Therefore our aim is to go a step further and develop a 
CPU-only implementation of a CFD solver that would 
have selected features of industrial solvers: support for 
arbitrary meshes (orthogonal or nonorthogonal) and 
time-dependent boundary conditions. 

3. Implementation 

PISO and SIMPLE are standard CFD solvers for incom- 



pressible flows and we followed [IJ, |27[ in their implemen- 
tation. We used the finite volume method (FVM) to dis- 
cretize Navier-Stokes equations which are then solved iter- 
atively until convergence. This iterative procedure consists 
of a sequence of well-defined steps. For example, PISO 
starts with the solution of the momentum equation fol- 
lowed by a series of solutions of the pressure equation and 
explicit velocity corrections. The SIMPLE model works on 
similar principles, but is optimized for steady-state flows. 
Thus, at such a coarse-grain level of description, both al- 
gorithms are essentially sequential and cannot be paral- 
lelized. Fortunately, individual steps of these algorithms 
can be parallelized using GPU. 

To port the solvers to the GPU architecture, we used 
CUDA, the computer architecture and software develop- 
ment tools for modern Nvidia accelerators d, One 
of the most distinguishing features of the CUDA program- 
ming model is the hierarchical organization of the memory. 
Thus, one of major contributions of our paper is reorgani- 
zation of solver data structures to enhance the efficiency 
of internal data transfers in a GPU device. In particu- 
lar, we focused on enhancing efficiency of the solution of 
large sparse linear systems, the most time-consuming op- 
eration in the PISO and SIMPLE solvers. This nontrivial 
problem is the subject of intensive research [H , 13 , , as 
many advanced techniques, like LU-based preconditioning, 
contain large serial components. Here we focus on con- 
jugate gradient (CG) and biconjugate gradient stabilized 
(BiCGStab) iterative solvers with Jacobi preconditioning 
[l|, two methods known to be amenable to effective fine- 
grained parallelization. 

S. 1 . Data format 

Since CPUs not only execute programs in parallel, but 
also access their memory in parallel, choosing right data 
structures is of highest importance. The data processed 
in FVM-based CFD solvers come from discretization of 
space, time and fiow equations. Space is divided into a 
mesh of N cells. Cells are polyhedrons with flat faces, and 
each face belongs to exactly two polyhedrons or is a bound- 
ary face. Pressure and velocity are dcflned at ccntroids of 
the cells. Since partial differential equations are local in 
space and time, their discretization leads to nonlinear alge- 
braic equations relating the velocity and pressure at each 
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polyhedron with their values at adjacent cells only. After 
linearization, these equations reduce to a linear system 



Ax 



where x and b are vectors of length TV and A is a sparse 
matrix such that Aij 7^ if and only if cells i and j have a 
common face or i = j. The value of Aij can depend on the 
current and previous values of the pressure and velocity at 
z, j as well as on some face-specific parameters, e.g. area 
of the face. Matrix A must be assembled many times and 
then used in a linear solver. During these operations A is 
accessed in rows or columns as if in sparse matrix- vector 
and sparse transposed matrix- vector products (STMVP). 
Although the highest priority must be granted to the op- 
timization of SMVP, A must be at the same time stored 
in a way enabling a reasonably efficient implementation of 
STMVP. 

Several formats designed for efBcient implementation of 
the SMVP kernel on modern GPUs were investigated by 
Bell and Garland [H and the data format implemented in 
our SIMPLE and PISO solvers is similar to their hybrid 
format. In the original hybrid format A is split into two 
disjoint parts: A = B + C, where B is stored in the ELL 
format; whereas C is stored in the COO format. Consider 
the following example: 



B = 



I Bqq Bqi Bq3 \ 
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In ELL, this matrix would be represented in two 2D arrays: 
V and I Q, 
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In general, dimensions of both of these arrays are N x K, 
where N is the number of cells, if is a small integer, and 
the entries in I arc integers between and — 1. While 
this format gives excellent memory bandwidth when the 
matrix is accessed by rows, it is completely inadequate for 
accessing it by columns. We can, however, take advantage 
of the fact that A is structurally symmetric {Aij 7^ iff 
Aji 0) and extend the ELL format with an additional 
N X K integer array J defined indirectly as follows. For 
the sake of clarity assume that B is also structurally sym- 
metric. By definition of ELL, for any < z < A^ and 
< k < K, Vik — Bij with j ~ lik- The corresponding 
entry in the transposed matrix, Bji, is stored in V in row 
j and some column k' (0 < k' < K). The value of Jik is 
defined as k' . Evaluation of one entry of J for the exem- 
plary matrix B is illustrated in Fig. [U and its full form 
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Figure 1: A schematic of evaluation of J[l] [2] for the cxamplary 
matrix described in the text. First it is checked V[l] [2] hold -812- 
Then the indices of B\2 are transposed to get ,821 and since the 
latter is stored in V in column 0, J[l] [2] = 0. 
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To access B by rows one can simply use the ELL format, 
i.e., arrays V and I. To access a column k of B, one first 
reads the entries in row k of array I, which in this case 
identify the row numbers of nonzero entries in this column. 
These row numbers are then used as row indices into V, 
with column indices read from J. 

If B is not structurally symmetric, i.e. if B^ 7^ and 
Bji = for some then the value of Aji is stored in C 
{Cji = Aji) and Aij is stored at Vit with some < fc < A'. 
This can be indicated in the computer representation by 
assigning a negative value to Ji^. In a similar way the 
computer representation of C stores the information that 
Aij is to be found in B. 

Note that while arrays I and J facilitate column ac- 
cesses to a matrix, these accesses rely on indirect address- 
ing, which results in uncoalesced, inefficient memory trans- 
fers. The only way to improve this would be to store a 
transposed matrix besides the original one, but this would 
require not only extra storage, but also a costly matrix 
transposition to be repeated every time a new matrix is 
assembled. In contrast, arrays I and J can be initialized 
only once in the program and be shared by all data re- 
lated to the faces. Note also that arrays I and J could 
be combined into a single integer array, Q, to reduce mem- 
ory requirements of the program. For example, /.y and Jij 
can be encoded in Q.y as Nxlij+ Jij or A'x J,y + I^j, with 
trivial decoding of lij and Jij through integer division and 
remainder operations. 

While in the original hybrid format C is represented in 
the COO format, we used compressed row sparse (CRS) 
instead to further reduce the memory usage. However, this 
particular choice does not seem to influence the overall per- 
formance significantly. Most of the polyhedrons forming 
the mesh are of the same simple shape, e.g., they arc tetra- 
hedra or parallelepipeds. This means that most cells have 
the same number of neighbours, typically 4 or 6. This, 
in turn, implies that most of the entries in A can fit into 
B, with C containing only a small fraction of nonzero en- 
tries in A. In particular, C usually vanishes altogether for 
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Table 1: Basic parameters of the meshes 



Case name 




Cells 




Faces Structured? 


cavlO'' 




1 


000 




3 


300 


yes 


cav47^ 




103 


823 




318 


096 


yes 


cavlOO^ 


1 


000 


000 


3 


030 


000 


yes 


cavlSl^ 


5 


929 


741 


17 


887 


506 


yes 


cav2233 


11 


089 


567 


33 


417 


888 


yes 


tplM 




960 


000 


3 


843 


500 


yes 


IcaO.SM 




458 


861 


1 


071 


019 


no 



structured meshes. 

4. Results 

4.1. Test cases 

To validate the code and evaluate its performance we 
solved three CFD problems: a steady flow in a 3D lid- 
driven cavity [i^l, the transient Poiseuille flow in two 
dimensions, and the steady flow through the human left 
coronary artery (LCA) (see Fig. [2]a-c). The cavity was 
a cube of side length L = 0.1 m. A constant velocity 
uiid = 1 m/s was imposed at its top face and the kine- 
matic viscosity v ~ 0.01 m^/s was assumed (Reynolds 
number Rc = 10). To evaluate how the solver performance 
depends on the problem size, we varied the regular mesh 
resolution from 10^ to 223^ cells (see Tab.[T]). 

In the transient Poiseuelle flow (tplM in Tab.[T]) we de- 
fined a 2D pipe of length 0.16 m and height 0.02 m and 
assumed i/ = 3.3 x 10^^ m^/s. The mesh resolution was 
3200 X 300 cells. We assumed a time-dependent inlet ve- 
locity: 

Uin{t) = ito sin(27r/t) 

with uq = 0.01 m/s and / ~ 0.5 s^^, and the zero-pressure 
boundary condition at the outlet. In this case we ran 5000 
PISO steps with (5t = 0.1 ms. 

Geometry of the artery simulated in the IcaO.SM case 
(Tab.HI) is visualised in Fig. [2:. We used v = 3-10"^ mVs 
and assumed a constant mass flow at the inlet (Q = 
9.975 X lO"'^ kg/s). We imposed the zero-pressure bound- 
ary conditions in all the outlets and no-slip boundary con- 
ditions at artery walls. A non-uniform and non-orthogonal 
mesh built of 458 861 cells was used. To solve this case, 
we ran 350 iterations of SIMPLE and 2000 iterations of 
PISO solvers, the latter with (5t = 0.1 ms. 

4.2. Validation 

To validate our GPU implementations we compared its 
solutions for all scenarios described in Sec. 14.11 t o the re- 
sults obtained with the OpenFOAM toolkit [27| running 
on the CPU. OpenFOAM was run with the same internal 
algorithms and control parameters as the GPU code except 
for the sparse unsymmetric linear solver: in the GPU code 
we used BiCGSTAB, whereas OpenFOAM uses BiCG. Ah 
computations were done in double precision. 



Results are depicted in Fig. [2] Panels (a), (b), and (c) 
show the geometry of the test cases: cavity, Poiseuille and 
LCA, respectively. Panels (d) and (f) show the Cartesian 
components of the velocity, panel (e) presents the veloc- 
ity component parallel to pipe walls at several times, and 
panels (g), (h), and (i) show the pressures. Data for the 
cavity and LCA were computed using the SIMPLE solver, 
whereas for Poiseuille problem PISO was used. 

In general, one can see very good matching of the GPU 
and CPU results. Only a slight disagreement in the min- 
imum value of the pressure in a cavity flow was found 
(see Fig. This discrepancy is marginal (results differ 
on the fourth significant digit), and we attribute it to the 
fact that different solvers for nonsymmetric sparse systems 
were used. 

4.3. Performance evaluation 

Performance tests of the GPU code were done on the 
Tcsla C2070 graphics accelerator attached to a PC run- 
ning 64-bit Ubuntu 10.04 LTS, graphics driver v. 290.10, 
Nvidia CUDA 4.1 and gee 4.4.3. The reference CPU tests 
were performed using OpenFOAM v. 1.7 on the Intel Xeon 
X5670 processor, which is a 6-core chip with hyperthread- 
ing. For this reason we compared a single GPU accelerator 
with a single CPU processor running 12 threads. More- 
over, since our GPU code relies on a very simple Jacobi 
(diagonal) preconditioning, we also repeated all CPU tests 
using the geometric-algebraic multi-grid solver (GAMG), 
which is considered to be among the fastest solvers of the 
pressure equation available in OpenFOAM. 

4-3.1. Time to solution 

We define the solver performance as the total time to 
solution, including the time necessary to read the test case 
definition from a disk. To find it, we ran GPU and CPU 
(OpenFOAM) solvers until residuals in the pressure and 
velocity solvers dropped below a given threshold, typically 
10~^ for smaller and 10"''^ for larger cases. Depending 
on the problem size, this time varied from 1 second to 
almost 31 hours. Similarly, we define the effective GPU 
acceleration as the ratio of the GPU solver performance 
to its CPU counterpart. 

Figure|3^ shows the effective acceleration when the GPU 
and CPU solvers use essentially the same algorithms and 
control parameters, the only exception being the BiCG 
solver rather than BiCGStab used by OpenFOAM to solve 
the velocity equations. As expected, in this case the 
relative performance of the GPU and CPU implementa- 
tions depends strongly on the problem size. The GPU is 
slower than the CPU if a mesh has less than approximately 
TV = 10^ cells, and is significantly faster, up to about 4.2 
times, if N is of order of a million or more. This reflects 
the fact that the GPU is a massively parallel device that 
needs tens of thousands of threads for efficient program 
execution. One can also see that the relative CPU-CPU 
performance is similar for the PISO and SIMPLE solvers. 
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Figure 2: (Color online) Validation cases: a) lid-driven cavity (10*' cells), b) transient Poiseuille flow, c) coronary artery. The dotted (online: 
red) line is the cross-section along which the results are evaluated. Results: d,f) velocity components u, v, w (parallel to the x, y and z axis, 
respectively) along the cross-section; e) velocity u at several times (in seconds); g,h,i) pressure along the cross-section. The results obtained 
on the CPU and GPU are shown as symbols (•) and solid lines, respectively. Only part of the available CPU data are shown for clarity. 
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Figure 3: Acceleration of the GPU relative to the CPU (OpenFOAM) 
implementations of SIMPLE and PISO solvers. In all cases the lin- 
ear solvers used by the GPU were Jacobi-preconditioned BiCGStab 
and CG. The CPU implementation used BiCG and CG with Jacobi 
preconditioner (a) or GAMG with the FDIC preconditioner (b). 



Table 2: Cumulative number of the pressure solver iterations: 
Jacobi-preconditioned CG (GPU and CPU) and GAMG (CPU). 



Case 


GPU-CG 


CPU-CG 


CPU-GAMG 




SIMPLE 




cavlO^ 


2 468 


2 474 


269 


cav47^ 


122 216 


103 321 


3 699 


cavlOO^ 


635 086 


629 182 


14 721 


cavlSl^ 


1 424 531 


1 618 634 


29 191 


cav2233 


1 733 366 


1 766 484 


27 220 


lca0.5M 


157 176 


159 959 


1 704 






PISO 




cavlO^ 


104 972 


110 614 


10 507 


cav47^ 


575 153 


568 388 


16 549 


cavlOO^ 


1 255 664 


1 352 614 


25 257 


cavlSl^ 


2 726 289 


3 010 947 


43 760 


cav2233 


3 705 806 


3 716 480 


100 279 


lca0.5M 


1 770 672 


1 830 437 


9 875 


tplM 


24 117 157 


27 042 577 


34 589 



Comparison of the GPU implementations with Open- 
FOAM exploiting the GAMG pressure solver is depicted 
in Fig.[3]3. In this case the GPU can deliver a small perfor- 
mance gain (about 30%-40%) only for the cavlOO'^ case. 
The major factor affecting the relative CPU-CPU perfor- 
mance is now the structure of the problem rather than 
just its size: GAMG turns out much more efficient for 
IcaO.SM and tplM cases than for the driven cavity. This 
is directly related to the fact that for the former problems 
a Jacobi-preconditioned CG solver used in the GPU imple- 
mentation needs a large number of iterations to converge, 
see Tab. [H In the extreme case of the transient Poiseuille 
flow, tplM, the CG solver requires almost 700 times more 
iterations than GAMG. Even though GAMG iterations are 
far more complex, a GAMG-based PISO solver running 
on the CPU turns out 13 times faster than its CG-based 
counterpart running on the GPU. 



The values in the second and third column of Tab. [2] 
differ by up to about 10% even though they refer to es- 
sentially the same quantity. This reflects the fact that 
the GPU and CPU implementations are not equivalent, 
as they use different linear solvers for velocity equations. 
Moreover, the order in which floating point operations are 
executed in both architectures is different, which leads to 
architecture-specific accumulation of numerical errors and 
divergent program execution. 



4.3.2. Profiling 

Profiling large-scale GPU code can be quite problematic. 
On the one hand, CPU-oriented profilers, like oprofile, usu- 
ally find it difficult to distinguish calls to different CUDA 
functions. On the other hand, CUDA-specific profilers, 
like Nvidia Visual Profiler, are designed to provide fine- 
grained information on kernel execution (such as branch 
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Figure 4: Percentage of the time spent in CG and BiCGStab linear 
solvers during a full PISO (P) and SIMPLE (S) solver execution. 



divergence or non-coalesced device memory accesses), and 
can incur prohibitive runtime overhead. Therefore we 
used a different strategy — since each GPU kernel must be 
launched from the CPU host, we modified the CPU source 
code by inserting calls to gettimeof day system timer. 

Figure |3] shows the fraction of the total computation 
time the CPU-accelerated PISO and SIMPLE solvers 
spent in CC and BiCGStab linear solvers. As expected, 
the Jacobi-prcconditioncd CG solver turned out the most 
time-consuming part of our CFD solvers, a single factor 
limiting the overall performance in all but the smallest 
cases. Other major components of our implementation, 
including the Jacobi-preconditioned BiCGStab solver of 
sparse nonsymmctric linear systems, are negligible from 
the performance point of view. This is no surprise, as our 
implementation of the CG solver uses the simplest non- 
trivial preconditioner, characterized by a poor convergence 
rate. This, in turn, can lead to huge numbers of internal 
CG iterations listed in Tab. [2j 

As the data in Fig. [3] show that in some cases our GPU 
implementation is competitive to an algorithmically differ- 
ent, highly optimized CPU implementation, one might ask 
whether it is possible to improve the relative CPU-CPU 
performance simply by improving the GPU implementa- 
tion of the Jacobi-preconditioned CG solver. To answer 
this question, we profiled the CG solver and show the re- 
sults in Fig. [S] Six groups of elementary numerical opera- 
tions were investigated: SMVP (sparse matrix dense vec- 
tor multiplication), daxpy (scalar times vector plus vector, 
y = a ■ X + y), dot product, reduction, diagonal precon- 
ditioner, and all other operations (e.g. memory allocation 
and management). Since all five explicitly listed opera- 
tions take at least 10% of the total CG solver time, signifi- 
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Figure 5: Percentage of the time spent in the main components of 
the Jacobi-preconditioned CG solver PISO (P) and SIMPLE (S). 



Table 3: Computation time of elementary matrix assembly oper- 
ations in SIMPLE and PISO solvers. The data for the divergence, 
gradient and Laplacian arc normalized relative to the SMVP time. 
The last column shows orthogonality of the mesh. 



Case 


SMVP time 


Normalized time 


Orth. 


[us] 


div. 


grad. 


Lap. 


cavlO^ 


41 


3.6 


16.8 


2.3 


yes 


cav47^ 


139 


4.1 


18.3 


2.5 


yes 


cavlOO^ 


1 103 


2.9 


10.7 


2.4 


yes 


cavlSl^ 


6 763 


2.5 


9.1 


2.3 


yes 


cav2233 


12 620 


2.5 


9.0 


2.3 


yes 


lca0.5M 


708 


2.6 


12.1 


58.7 


no 


tplM 


775 


2.8 


10.4 


2.3 


yes 



cant improvement of the solver performance would require 
significant improvements in practically all of the solver 
components. However, as many of them (SMVP, daxpy, 
dot product) are memory-bound and are already highly 
optimized, no further significant improvement seems pos- 
sible without additional hardware support. 

4-3.3. Matrix assembly 

While the time to solution for the cases considered in 
this study turned out completely dominated by the CG 
solver, this need not be so in other cases, other implemen- 
tations, or other CFD solvers. Therefore we also profiled 
elementary steps of the linear matrix assembly stage. We 
considered three such operations: Laplacian, divergence, 
and gradient and compared their execution times with that 
of the SMVP, as all these operations are algorithmically 
similar. The results are collected in Tab. [31 On can see 
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that for orthogonal meshes the matrix assembly time is 
dominated by the gradient. This is a GPU-specific phe- 
nomenon, a consequence of the fact that in this operator 
the faces are accessed in a different order than in SMVP, 
which leads to uncoalesced memory accesses and signifi- 
cant overall performance degradation. For nonorthogonal 
meshes the Laplacian requires additional, costly correc- 
tions and dominates the matrix assembly time. 

4.3.4- Memory usage 

The largest problem we were able to run on a 6 GB 
device, cav2233, occupied 5.9 GB for the PISO and 5.7 GB 
for the SIMPLE solver, respectively. In each test case 
the device memory footprint was similar for PISO and 
SIMPLE and, for problems of size > 10^, equaled to « 540 
bytes per cell. 

5. Discussion 

Our results show that a GPU can outperform a six-core 
server-class CPU running algorithmically equivalent im- 
plementations of popular CFD solvers, SIMPLE or PISO, 
by a factor slightly exceeding 4. This value is consistent 
with the ratio 4.5 of the theoretical memory bandwidths 
of the two processors used in our test, 144 GB/s (GPU) 
and 32 GB/s (CPU). 

We show, however, that despite its huge computational 
power, the GPU is still inferior to the CPU if the latter 
uses the most efficient algorithms. This is because the 
GPU is not flexible enough to allow both direct and ef- 
ficient implementation of many procedures optimized for 
CPUs. For example, some of the most efficient precondi- 
tioners for the CG sparse linear solver are based on incom- 
plete LU decomposition, which, unfortunately, has a large 
serial component. This issue has attracted a lot of inter- 
est. Recently, Naumov showed a 2x speedup over a quad- 
core CPU in the incomplete-LU preconditioned iterative 
methods Helfenstein and Koko reported a 10-fold 
acceleration over a single-threaded CPU code, and Nvidia 
included ILUO-class preconditioners in its forthcoming cus- 
parsc 5.0 GPU library [l^. Even better results for the 
pressure solver performance can often be achieved with 
multigrid methods and the effort to produce efficient GPU 
implementation of such solvers is also intensive, see for 
example Geveler et al 
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and references therein. In par- 
ticular, Geveler et al. reported a 5x average speedup over a 
multithreaded CPU code. These achievements are mostly 
concentrated on accelerating a particular, rather narrow 
aspect of a CFD solver. By replacing in our code the 
Jacobi-peconditioned CG with one of the above-mentioned 
solutions, it should be possible to produce a fully func- 
tional CFD solver that would at least match the speed of 
standard CFD software running on CPUs, which is our 
plan for the future. 

If a faster pressure solver is used in future CFD soft- 
ware on CPUs, the role of the remaining solver compo- 
nents will increase dramatically. Moreover, since the gap 



between the computational power of processors and the 
speed of the bus connecting GPU with CPU is expected 
to be still widening, porting of complete software rather 
than accelerating only its parts will become more and more 
desirable. 



6. Conclusions 

We implemented PISO and SIMPLE solvers on CPUs 
and investigated their main properties. The implementa- 
tions are fully functional, execute completely on the GPU 
using double precision and support time-dependent bound- 
ary conditions and arbitrary meshes. To facilitate a com- 
plete GPU port, we proposed a generic data format for 
internal data storage which helps implement elementary 
CFD solver operations like SMVP, gradient or Laplacian. 
If GPU and CPU execute essentially the same algorithms, 
a GPU (Tesla C2070) can outperform a server-class, 6- 
core CPU (Intel Xeon X5670) by up to about 4.2 times. 
We also investigated how the acceleration scales with the 
problem size and estimate that the minimum problem size 
for which GPU can outperform CPU is w 500 000. Since 
our GPU implementation exploits a simple pressure solver, 
we compared our results against the CPU running SIM- 
PLE or PISO with a state-of-the art multigrid (GAMG) 
pressure solver and found that a better pressure solver is 
needed for serious CFD applications on CPUs. We also 
carried out a detailed, coarse- and fine-grained profiling 
of our solvers, finding that their implementation is close 
to optimal, which confirmed again that the only way for 
CPUs to match the efficiency of CPUs in PISO and SIM- 
PLE kernels is a better pressure solver. 
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